[VanillaNet: the Power of Minimalism in Deep Learning]https://arxiv.org/abs/2305.12972

First some very introductory notes for the newbies. Since this paper is fairly short and not nearly as difficult to read as the last one, we may as well go deep on the topics it does cover.
Btw, I first heard of this paper from (the firehose that is) marktechpost.

ImageNet

The ImageNet large scale visual recognition challenge was an annual object classification challenge that propelled much of the early development in Geometric Deep Learning. ImageNet requires models to classify realistic images scraped from the Web into one of 1000 categories: such categories are at the same time diverse (covering both animate and inanimate objects), and specific (with many classes focused on distinguishing various cat and dog breeds). Hence, good performance on ImageNet often implies a solid level of feature extraction from general photographs, which formed a foundation for various transfer learning setups from pre-trained ImageNet models.
–from “Geometric Deep Learning” by Bronstein et. al. (2021)

Why CNNs?

In a 2-dimensional convolutional layer, A will look at patches. For each patch, A will compute features. For example, it might learn to detect the presence of an edge. Or it might learn to detect a texture. Or perhaps a contrast between two colors.
–from colah’s blog, an intro with great diagrams on Conv Nets

The math background for convolutions.

Convolutional neural nets use discrete convolutions on a grid (but they can be extended to an arbitrary graph).

Inductive Bias: Translational Invariance¹

The no-free-lunch theorem for machine learning (Wolpert et al., 1997; Baxter, 2000) basically says that some set of preferences (or inductive bias) over the space of all functions is necessary to obtain generalization, that there is no completely general-purpose learning algorithm, that any learning algorithm will generalize better on some distributions and worse on others. Typically, given a particular dataset and loss function, there are many possible solutions (e.g. parameter assignments) to the learning problem that exhibit equally “good” performance on the training points. Given a finite training set, the only way to generalize to new input configurations is then to rely on some assumptions or preferences about the solution we are looking for. An important question for AI research aiming at human-level performance then is to identify inductive biases that are most relevant to the human perspective on the world around us. Inductive biases, broadly speaking, encourage the learning algorithm to prioritise solutions with certain properties.

–from “Inductive Biases for Deep Learning of Higher-Level Cognition” by Goyal & Bengio (2020)

For example, a convolutional feature can learn to detect an eye no matter where it is in the image using the same “eye” detector translated across the image, firing when it covers an eye.

Vanishing Gradients

The vanishing gradient problem is a common issue that arises when training deep neural networks with many layers of neurons. It occurs when the gradients of the error function become so small that they approach zero, making it difficult for the network to learn from the available data. This can lead to poor performance and slow convergence during training.

There are several reasons why this might happen:

Excessive layer depth: As the number of layers increases, the gradients have to pass through more nodes before reaching the output layer. With each node, there is a chance for the gradients to be shrunk or even cancelled out entirely.
Nonlinear activation functions: Activation functions such as ReLU and sigmoid are used to introduce nonlinearity into the network. However, these functions also reduce the magnitude of the gradients passing through them.
Initialization issues: Improper initialization of the weights and biases can cause some units to become stuck in local minima or maxima, which leads to difficulty in updating their parameters.
Lack of regularization: Regularization techniques like dropout and weight constraints help prevent overfitting by reducing the capacity of the model. When not properly applied, this can result in flat regions in the loss landscape where gradients disappear.
To address the vanishing gradients problem, there are various approaches that can be taken. Some of these include:

Reduce layer depth: One way to mitigate the effect of excessive layer depth is to use fewer layers, which will make it easier for gradients to flow through the network.
Use better activations: Certain activation functions are less prone to causing gradient decay, such as tanh, softplus, and ELUs. Additionally, architectural choices like residual connections and skip connections can improve the flow of information throughout the network.
Initialize carefully: Properly initializing the weights and biases can significantly impact the success of your model. Techniques such as Xavier/He uniform initialization, Glorot normal initialization, and Kaiser initialization can help ensure that your model is well-behaved and easy to train.
Apply regularization: Regularization techniques such as weight decay, Dropout, Batch Normalization, and Early Stopping can help control the model’s complexity and avoid overfitting. These methods encourage the network to generalize better and prevent the formation of bad local optima.
Change learning rate schedule: Increasing the learning rate at the beginning of training can help overcome the vanishing gradients problem, but doing so too aggressively can lead to exploding gradients in the opposite direction. Using a cosine annealing schedule or Adaptive Learning Rates can help smooth out the learning process and achieve better results.
–from Guanaco when asked to explain the vanishing gradients problem

VanillaNet chose to circumvent the issue by limiting the depth of their model.

Complex Model Additions (not used in VanillaNet)

Shortcuts, or Skip Connections

Residual blocks combine the output from a non-linear layer (dx) with a skip connection (x) to learn “residuals” in the block, the (small?) change required, rather than the full output (x + dx). Without skip connections, information gets lost as it travels from the bottom layers to the top layers of the network. Skip connections allow the network to learn features at various levels of abstraction and combine them to create more accurate predictions.

The skip connections can be done in various ways. The important aspect is bringing the signal across layers unchanged so that the intermediate layers only need to learn the change from that signal – this works very well if the “residuals” are simpler than the full signal.

Self-Attention

Self-attention is a mechanism used in machine learning to focus on the most relevant parts of an image. It involves a network of neural connections that process information in a non-linear way, allowing it to identify patterns and features that are relevant to the task at hand. It was first introduced in Attention is all you need [Vaswani et al., 2017], where they introduced the first Transformer model. More generally, attention was first introduced in NLP for RNN models in the paper Neural Machine Translation by Jointly Learning to Align and Translate [Bahdanau et al., 2014].
Learn them more deeply by coding self attention from scratch.

Training Tricks (used by VanillaNet)

Under this situation, we find that the inference speed has little relationship with the number of FLOPs and parameters. Taking MobileNetV3-Large as an example, though it has a very low FLOPs (0.22B), its GPU latency is 7.83, which is even larger than our VanillaNet-13 with 11.9B FLOPs. In fact, the inference speed in this setting is highly related to the complexity and number of layers.
–from section 4.3 of VanillaNet

VanillaNet gets very high top-1 accuracy with low latency on single sample inference (as might be the case for the model used in an app). They claim that even when the parameter counts of models with residual connections and attention heads are lower, their model is still faster. This seems worth verifying.

The accuracy they achieve is credited to their training regime which is full of “tricks”.

To train our proposed VanillaNets, we conduct a comprehensive analysis of the challenges associated with their simplified architectures and devise a “deep training” strategy. This approach starts with several layers containing non-linear activation functions. As the training proceeds, we progressively eliminate these non-linear layers, allowing for easy merging while preserving inference speed. To augment the networks’ non-linearity, we put forward an efficient, series-based activation function incorporating multiple learnable affine transformations. Applying these techniques has been demonstrated to significantly boost the performance of less complex neural networks.
–from section 1 of VanillaNet

Better late than never (after the meetup, alas), here are some notes about the activation series and 2 →1 layer merging. What allows the final merging of 2 layers is that the final model has been trained to have no activation between them. So every second activation layer is diminished across epochs, and, in their own words:

Therefore, at the beginning of training (e = 0), A′(x) = A(x), which means the network
have strong non-linearity. When the training converged, we have A′(x) = x, which means the two convolutional layers have no activation functions in the middle.
– following Equation (1)

To really make things spicy, they don’t just a regular single activation but a series activation each with their own scale and offset. In the meeting, to answer why this was more non-linear, I resorted to arm gestures that Pablo eloquently extrapolated to the YMCA dance.

With 1, 2 and 3 relu’s offset with opposing weights we can divide the input into non-contributing (zero), reinforcing (+ve) and opposing (-ve) regions of influence!

Having a portion of the training with such high non-linearity allows the model parameters to explore values they couldn’t otherwise reach with regular single relu activations. Much like annealing (in materials and neural network training): the training begins with strong non-linearities that get reduced away and the final model finds a better minimum than it otherwise could.

Bruce’s comments

General comments

The authors may have done something interesting, but they claim much more than they provide objective evidence for. They also make sweeping self-congratulatory subjective assessments that put me in a bad mood.

The general idea

We live in a world where it seems to be all-transformers-all-the-time for language models and nothing-but-resnets for image tasks. Lots of people find it hard to wrap their heads around transformers, so avoiding them might seem appealing. (But really, they are not that hard to understand.) The authors are really allergic to skip connections for some reason, so they would like to not use ResNets.

They focus on image tasks (classification, object detection, segmentation) where convolutional neural networks (CNNs), like ResNets, have traditionally been strong and are still widely used. The best models are very deep (i.e., have lots of layers).

Their approach is to make several tradeoff decisions that are different from the usual ones.

They go wide instead of deep.
They add some training complexity in exchange for benefits at inference time.
They add complexity to the activation functions instead of relying on depth to provide non-linearity.
They do not use skip connections (like ResNet) or self-attention (like Transformers). Just plain old (vanilla!) CNN layers.

Their main claim to simplicity is #4.

How can something so simple still work?

Because it’s not that simple, really. They avoid some commonly used architectural features, but they add complexity elsewhere.

They end up with more, not fewer, parameters and it takes more FLOPs to run.
They use a “series activation function” with several components rather than a single function. (This seems novel and may be useful in other contexts.)
They use a “deep training technique” that starts with a deeper model but ends up shallow.
Their model has a larger activation region—i.e., it sees more of the input data. (This is not convincingly demonstrated.)

Why is it better?

The authors make the following claims. Many are backed by theoretical arguments rather than empirical evidence.

faster (they do demonstrate this)
easier to deploy (vague statement—no evidence provided)
smaller memory requirements, something about off-chip memory traffic (not vague, but no evidence provided)
better interpretability (vague statement—no evidence provided)
improved flexibility (vague statement—no evidence provided)
strong baselines (it does achieve comparable performance to deeper architectures)

If you find some of these claims surprising given the “why does it work” reasons above, I’m with you.

I am not the best person to assess the theoretical arguments that are made. I look forward to hearing from others who are closer to implementation issues.

Ray’s Notes

Code dive into VanillaNets

https://www.youtube.com/watch?v=o8pJcvL8Lw8 quick code dive into VanillaNet, 2x speed watch recommended, it’s just uploaded so the youtube transcription isn’t ready yet.

The notebook is on kaggle, as free gpu etc.

I like this paper well!

Especially the “train big, deploy small” part.

1st of all it’s very useful not only for production, kaggle competition, as stacking many CNNs heavily during inference time is common practice for every team.

I didn’t pay too much attention to its assortment of performance metrics in every detail.(Who really cares about imagenet anyway and they just did coco detection) As such simplicity might end up improving more and more downstream tasks. Eg. most GANs are using residual blocks as basic unit, I don’t know what they would be under Vanilla block, same for all UNet as in most of the diffusion models.

Its trend is also like transformer against gated RNN based structure on NLP, as simpler structure permits more intense FLOPs.

Looking forward to other paper/experiments popping out around this paper.

Ray’s more annoying afternotes

Hmmmm, that YMCA joke really cracked me up, but no.

A series activation, the word `series`, means different grid/ pixel input in the surrounding, in no way I imagine it would do the YMCA thing. Maybe it’s because I failed to see it. In this situation it’s a little different than, say, the ax^2 +bx^3 break the linearity (YMCA pattern).

The non-linearity and its relationship with ReLU as the Deep Learning Book explained it (in an early chapter of that book with a cover printed with many flowers). It means usually the linear layer can process most signals with most of the gates well ( AND, OR, NOR etc), but in no way it can perform XOR gate, with ReLU, it can, it can break pattern of 1 element of input increase in value can only push the output only in one direction, hence the non-linearity. And Non-Linearity means most of the signals can enjoy a complete set of gates when combined with other signals.

So in this paper, my guess is if x_i increases delta_x, it can push the y_ to be bigger or smaller, not depends on the scale of the delta_x (per YMCA), but its surroundings.

Btw good notes fellas, enjoy your narrations.

Sung’s Comments

Thanks Lara, Bruce, Ray for the excellent note and comments. I should have read this before.
I misunderstood the authors of the paper was ‘ranking’ non-linearity as done ‘smoothness’. A common example is
f(x)=x^2 Sin(1/x) when x nonzero, and f(0)=0
This function f oscillates crazy nexr x=0 but nice function like x^2 ‘clamps’ down the oscillation enough it is differentiable at x=0. But f is not as ‘smooth’ as Sine function at x=0 obviously. In fact, the derivative of f is not even continuous at x=0

I thought the authors were emulating some exotic breakage by choosing some large n in the series of activation funtions, a sort of even worse ‘break.’ I did not expect adding a handful of breaks (n=2 or 3) and doing YMCA will change much. I guess I was wrong and NN conitinues to surprise me.

Footnotes

Invariance means that the result doesn’t matter where the cat is; an equivariant model looks everywhere for a cat in the same way but fires specifically where the cat is. A convolutional layer is actually equivariant; if we then aggregate over all positions, the result is invariant.↩︎